{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zzZbP0LM6m5z"
      },
      "source": [
        "# Add semantic search to Elasticsearch\n",
        "\n",
        "Part 2 and Part 3 of this series showed how to index and search data in txtai. Part 2 indexed and searched a Hugging Face Dataset, Part 3 indexed and searched an external data source. \n",
        "\n",
        "txtai is modular in design, it's components can be individually used. txtai has a similarity function that works on lists of text. This method can be integrated with any external search service, such as a REST API, a SQL query or anything else that returns text search results. \n",
        "\n",
        "In this notebook, we'll take the same Hugging Face Dataset used in Part 2, index it in Elasticsearch and rank the search results using a semantic similarity function from txtai.\n",
        "\n",
        "**Make sure to select a GPU runtime when running this notebook**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xk7t5Jcd6reO"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai`, `datasets` and `Elasticsearch`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0y1UA4-q-YdA"
      },
      "source": [
        "%%capture\n",
        "\n",
        "# Install txtai, datasets and elasticsearch python client\n",
        "!pip install git+https://github.com/neuml/txtai datasets elasticsearch\n",
        "\n",
        "# Download and extract elasticsearch\n",
        "!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz\n",
        "!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz\n",
        "!chown -R daemon:daemon elasticsearch-7.10.1"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M5c_WxmxCShQ"
      },
      "source": [
        "Start an instance of Elasticsearch directly within this notebook."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3ZfJeWbM6wmj"
      },
      "source": [
        "import os\n",
        "from subprocess import Popen, PIPE, STDOUT\n",
        "\n",
        "# If issues are encountered with this section, ES can be manually started as follows:\n",
        "# ./elasticsearch-7.10.1/bin/elasticsearch\n",
        "\n",
        "# Start and wait for server\n",
        "server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))\n",
        "!sleep 30"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hSWFzkCn61tM"
      },
      "source": [
        "# Load data into Elasticsearch\n",
        "\n",
        "The following block loads the dataset into Elasticsearch."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "So-OBvUT61QD"
      },
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "from elasticsearch import Elasticsearch, helpers\n",
        "\n",
        "# Connect to ES instance\n",
        "es = Elasticsearch(hosts=[\"http://localhost:9200\"], timeout=60, retry_on_timeout=True)\n",
        "\n",
        "# Load HF dataset\n",
        "dataset = load_dataset(\"ag_news\", split=\"train\")[\"text\"][:50000]\n",
        "\n",
        "# Elasticsearch bulk buffer\n",
        "buffer = []\n",
        "rows = 0\n",
        "\n",
        "for x, text in enumerate(dataset):\n",
        "  # Article record\n",
        "  article = {\"_id\": x, \"_index\": \"articles\", \"title\": text}\n",
        "\n",
        "  # Buffer article\n",
        "  buffer.append(article)\n",
        "\n",
        "  # Increment number of articles processed\n",
        "  rows += 1\n",
        "\n",
        "  # Bulk load every 1000 records\n",
        "  if rows % 1000 == 0:\n",
        "    helpers.bulk(es, buffer)\n",
        "    buffer = []\n",
        "\n",
        "    print(\"Inserted {} articles\".format(rows), end=\"\\r\")\n",
        "\n",
        "if buffer:\n",
        "  helpers.bulk(es, buffer)\n",
        "\n",
        "print(\"Total articles inserted: {}\".format(rows))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X5RO-VNwzMAo"
      },
      "source": [
        "# Query data with Elasticsearch\n",
        "\n",
        "Elasticsearch is a token-based search system. Queries and documents are parsed into tokens and the most relevant query-document matches are calculated using a scoring algorithm. The default scoring algorithm is [BM25](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables). Powerful queries can be built using a [rich query syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax) and [Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl.html). \n",
        "\n",
        "The following section runs a query against Elasticsearch, finds the top 5 matches and returns the corresponding documents associated with each match.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 334
        },
        "id": "ucd9mwSfFTMm",
        "outputId": "1c49323e-1f29-491d-872a-14b5650d09f6"
      },
      "source": [
        "from IPython.display import display, HTML\n",
        "\n",
        "def table(category, query, rows):\n",
        "    html = \"\"\"\n",
        "    <style type='text/css'>\n",
        "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
        "    table {\n",
        "      border-collapse: collapse;\n",
        "      width: 900px;\n",
        "    }\n",
        "    th, td {\n",
        "        border: 1px solid #9e9e9e;\n",
        "        padding: 10px;\n",
        "        font: 15px Oswald;\n",
        "    }\n",
        "    </style>\n",
        "    \"\"\"\n",
        "\n",
        "    html += \"<h3>[%s] %s</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead>\" % (category, query)\n",
        "    for score, text in rows:\n",
        "        html += \"<tr><td>%.4f</td><td>%s</td></tr>\" % (score, text)\n",
        "    html += \"</table>\"\n",
        "\n",
        "    display(HTML(html))\n",
        "\n",
        "def search(query, limit):\n",
        "  query = {\n",
        "      \"size\": limit,\n",
        "      \"query\": {\n",
        "          \"query_string\": {\"query\": query}\n",
        "      }\n",
        "  }\n",
        "\n",
        "  results = []\n",
        "  for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n",
        "    source = result[\"_source\"]\n",
        "    results.append((min(result[\"_score\"], 18) / 18, source[\"title\"]))\n",
        "\n",
        "  return results\n",
        "\n",
        "limit = 3\n",
        "query= \"+yankees lose\"\n",
        "table(\"Elasticsearch\", query, search(query, limit))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <style type='text/css'>\n",
              "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
              "    table {\n",
              "      border-collapse: collapse;\n",
              "      width: 900px;\n",
              "    }\n",
              "    th, td {\n",
              "        border: 1px solid #9e9e9e;\n",
              "        padding: 10px;\n",
              "        font: 15px Oswald;\n",
              "    }\n",
              "    </style>\n",
              "    <h3>[Elasticsearch] +yankees lose</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.5808</td><td>El Duque adds to gloomy NY forecast The Yankees #39; staff infection has spread to the one man the team can #39;t afford to lose. Orlando Hernandez was scratched from last night #39;s scheduled start because </td></tr><tr><td>0.5688</td><td>Rangers Derail Red Sox The Red Sox lose for the first time in 11 games, falling to the Rangers 8-6 Saturday and missing a chance to pull within 1 1/2 games of the Yankees in the AL East.</td></tr><tr><td>0.5061</td><td>Rout leaves Yanks #39; lead at 3 Royals gain control with 10-run 5th Against a nothing-to-lose team such as the Kansas City Royals, the Yankees #39; manager wanted his team to put down the hammer early and not let baseball #39;s second worst team believe it had a chance.</td></tr></table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W1DkpcIob5kt"
      },
      "source": [
        "The table above shows the results for the query `+yankees lose`. This query requires the token `yankees`. The search doesn't understand the semantic meaning of the query. It returns the most relevant results with those two tokens.\n",
        "\n",
        "We can see in this case, the results aren't capturing the meaning of the search. Let's try adding semantic similarity to the search!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hMro47KedzJq"
      },
      "source": [
        "# Ranking search results with txtai\n",
        "\n",
        "txtai has a similarity module that computes the similarity between a query and a list of strings. Of course, txtai can also build a full index as shown in the previous notebooks but in this case we'll just use the ad-hoc similarity function.\n",
        "\n",
        "The code below creates a Similarity instance and defines a ranking function to order search results based on the computed similarity.\n",
        "\n",
        "`ranksearch` queries Elasticsearch for a larger set of results, ranks the results using the similarity instance and returns the top n results. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RUOj5zhFFK8N"
      },
      "source": [
        "%%capture\n",
        "from txtai.pipeline import Similarity\n",
        "\n",
        "def ranksearch(query, limit):\n",
        "  results = [text for _, text in search(query, limit * 10)]\n",
        "  return [(score, results[x]) for x, score in similarity(query, results)][:limit]\n",
        "\n",
        "# Create similarity instance for re-ranking\n",
        "similarity = Similarity(\"valhalla/distilbart-mnli-12-3\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UMFuv5-Hedfc"
      },
      "source": [
        "Now let's re-run the previous search."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 334
        },
        "id": "3jJI9OxU0dZk",
        "outputId": "3233d128-2bf4-4d53-f851-f9dd8f51629e"
      },
      "source": [
        "# Run the search\n",
        "table(\"Elasticsearch + txtai\", query, ranksearch(query, limit))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <style type='text/css'>\n",
              "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
              "    table {\n",
              "      border-collapse: collapse;\n",
              "      width: 900px;\n",
              "    }\n",
              "    th, td {\n",
              "        border: 1px solid #9e9e9e;\n",
              "        padding: 10px;\n",
              "        font: 15px Oswald;\n",
              "    }\n",
              "    </style>\n",
              "    <h3>[Elasticsearch + txtai] +yankees lose</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.9929</td><td>Ouch! Yankees hit new low INDIANS 22, YANKEES 0---At New York, Omar Vizquel went 6-for-7 to tie the American League record for hits as Cleveland handed the Yankees the largest loss in their history last night.</td></tr><tr><td>0.9874</td><td>Vazquez and Yankees Buckle Early Because Javier Vazquez fizzled while Brad Radke flourished, the Yankees sustained their first regular-season defeat by the Minnesota Twins since 2001.</td></tr><tr><td>0.9542</td><td>Slide of the Yankees: Pinstripes Punished George Steinbrenner watched from his box as his Yankees suffered the most one-sided loss in the franchise's long history.</td></tr></table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RXB2PaDZfd8o"
      },
      "source": [
        "The results above do a much better job of finding results semantically similar in meaning to the query. Instead of just finding matches with `yankees` and `lose`, it finds matches where the `yankees lose`. \n",
        "\n",
        "This combination is effective and powerful. It takes advantage of the high performance of Elasticsearch while adding a semantic search capability. We may already have a large Elasticsearch cluster with TBs (or PBs)+ of data and years of engineering investment that solves most use cases. Semantically ranking search results is a practical approach."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XBVL56fsRI86"
      },
      "source": [
        "# More examples\n",
        "\n",
        "Now for some more examples comparing the results from Elasticsearch vs Elasticsearch + txtai."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7IHS38SERRpQ",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "25325d94-ddc9-44f2-f602-d792ffeb0a8c"
      },
      "source": [
        "for query in [\"good news +economy\", \"bad news +economy\"]:\n",
        "  table(\"Elasticsearch\", query, search(query, limit))\n",
        "  table(\"Elasticsearch + txtai\", query, ranksearch(query, limit))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <style type='text/css'>\n",
              "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
              "    table {\n",
              "      border-collapse: collapse;\n",
              "      width: 900px;\n",
              "    }\n",
              "    th, td {\n",
              "        border: 1px solid #9e9e9e;\n",
              "        padding: 10px;\n",
              "        font: 15px Oswald;\n",
              "    }\n",
              "    </style>\n",
              "    <h3>[Elasticsearch] good news +economy</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.8756</td><td>Surprise drop US wholesale prices is mixed news for economy (AFP) AFP - A surprise drop in US wholesale prices in August showed inflation apparently in check, but analysts said this was good and bad news for the US economy.</td></tr><tr><td>0.7379</td><td>China investment slows Good news for officials who are trying to cool an overheated economy; austerity measures to remain. BEIJING (Reuters) - China reported a marked slowdown in investment and money supply growth Monday, but stubbornly </td></tr><tr><td>0.7145</td><td>Spending Rebounds, Good News for Growth  WASHINGTON (Reuters) - U.S. consumer spending rebounded  sharply July, government data showed on Monday, erasing the  disappointment of June and bolstering hopes that the U.S.  economy has recovered from its recent soft spot.</td></tr></table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <style type='text/css'>\n",
              "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
              "    table {\n",
              "      border-collapse: collapse;\n",
              "      width: 900px;\n",
              "    }\n",
              "    th, td {\n",
              "        border: 1px solid #9e9e9e;\n",
              "        padding: 10px;\n",
              "        font: 15px Oswald;\n",
              "    }\n",
              "    </style>\n",
              "    <h3>[Elasticsearch + txtai] good news +economy</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.9996</td><td>Spending Rebounds, Good News for Growth  WASHINGTON (Reuters) - U.S. consumer spending rebounded  sharply in July, the government said on Monday, erasing the  disappointment of June and bolstering hopes that the U.S.  economy has recovered from its recent soft spot.</td></tr><tr><td>0.9996</td><td>Spending Rebounds, Good News for Growth  WASHINGTON (Reuters) - U.S. consumer spending rebounded  sharply July, government data showed on Monday, erasing the  disappointment of June and bolstering hopes that the U.S.  economy has recovered from its recent soft spot.</td></tr><tr><td>0.9993</td><td>Home building surges Housing construction in August jumped to its highest level in five months, a dose of encouraging news for the economy #39;s expansion.</td></tr></table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <style type='text/css'>\n",
              "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
              "    table {\n",
              "      border-collapse: collapse;\n",
              "      width: 900px;\n",
              "    }\n",
              "    th, td {\n",
              "        border: 1px solid #9e9e9e;\n",
              "        padding: 10px;\n",
              "        font: 15px Oswald;\n",
              "    }\n",
              "    </style>\n",
              "    <h3>[Elasticsearch] bad news +economy</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.9228</td><td>Surprise drop US wholesale prices is mixed news for economy (AFP) AFP - A surprise drop in US wholesale prices in August showed inflation apparently in check, but analysts said this was good and bad news for the US economy.</td></tr><tr><td>0.6405</td><td>Field Poll: Californians liking economy Bee Staff Writer. Californians are slowly growing more optimistic about the health of the economy, but a majority still feels the state is in bad economic times, according to a new Field Poll.</td></tr><tr><td>0.6188</td><td>ADB says China should raise rates to cool economy China should raise interest rates to cool the economy and prevent a future buildup of bad loans in the banking system, the Asian Development Bank #39;s (ADB) Bei-jing representative Bruce Murray said.</td></tr></table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <style type='text/css'>\n",
              "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
              "    table {\n",
              "      border-collapse: collapse;\n",
              "      width: 900px;\n",
              "    }\n",
              "    th, td {\n",
              "        border: 1px solid #9e9e9e;\n",
              "        padding: 10px;\n",
              "        font: 15px Oswald;\n",
              "    }\n",
              "    </style>\n",
              "    <h3>[Elasticsearch + txtai] bad news +economy</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.9977</td><td>Aging society hits Japan #39;s economy Japan #39;s economy will be the most severely affected among industrialized nations by population aging, Kyodo News said Thursday.</td></tr><tr><td>0.9963</td><td>Funds: Fund Mergers Can Hurt Investors (Reuters) Reuters - Mergers and acquisitions have\\played an enormous role in the U.S. economy during the past\\several decades, but sometimes the results have been bad for\\consumers.  Similarly, consolidation in the mutual fund\\business has sometimes hurt fund investors.</td></tr><tr><td>0.9958</td><td>Signs of listless economy persist In a sign of persistent weakness in the US economy, a widely watched measure of business activity declined in August for the third consecutive month.</td></tr></table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h-Bk3KLjZMpF"
      },
      "source": [
        "Once again while Elasticsearch usually returns quality results, occasionally it will match results that aren't semantically relevant. The power of semantic search is that not only will it find direct matches but matches with the same meaning.  "
      ]
    }
  ]
}
